146

Applications in Natural Language Processing

5.11

Post-Training Embedding Binarization for Fast Online Top-K

Passage Matching

To lower the complexity of BERT, the recent state-of-the-art model ColBERT[113] employs

Contextualized Late Interaction paradigm to independently learn fine-grained query-passage

representations. It comprises: (1) a query encoder fQ, (b) a passage encoder fD, and (3)

a query-passage score predictor. Specifically, given a query q and a passage d, fQ and fD

encode them into a bag of fixed-size embeddings Eq and Ed as follows:

Eq = Normalize(CNN(BERT(”[Q]q0q1 · ql”))),

Ed = Filter(Normalize(CNN(BERT(”[D]d0d1 · dn”)))),

(5.44)

where q and d are tokenized into tokens q0qq · ql and d0d1 · dn by BERT-based WordPiece,

respectively. [Q] and [D] indicate the sequence types.

Despite the advances of ColBERT over the vanilla BERT model, its massive computa-

tion and parameter burden still hinder the deployment on edge devices. Recently, Chen et

al.[40] proposed Bi-ColBERT to binarize the embedding to relieve the computation burden.

Bi-ColBERT involves (1) semantic diffusion to hedge the information loss against embed-

ding binarization, and (2) approximation of Unit Impulse Function [18] for more accurate

gradient estimation.

5.11.1

Semantic Diffusion

Binarization with sign(·) inevitably smoothes the embedding informativeness into the bina-

rized space, e.g.,1, 1d regardless of its original values. Thus, intuitively, one wants to avoid

condensing and gathering informative latent semantics in (relatively-small) sub-structures

of embedding bags. In other words, the aim falls into diffusing the embedded semantics in

all embedding dimensions as one effective strategy to hedge the inevitable information loss

caused by the numerical binarization and retain the semantic uniqueness after binarization

as much as possible.

Recall in singular value decomposition (SVD), singular values and vectors reconstruct

the original matrix; normally, large singular values can be interpreted to associate with

major semantic structures of the matrix [242]. To achieve semantic diffusion via normalizing

singular values for equalizing their respective contributions in constituting latent semantics,

the authors introduced a lightweight semantic diffusion technique as follows. Concretely, let

I denote the identity matrix and a standard normal random vector p(h) Rd. During

training, the diffusion vector p(h) is iteratively updated as p(h) = ET

q Eqp(h1). Then, the

projection matrix Pq is obtained via:

Pq = p(h)p(h)T

||p(h)||2

2

.

(5.45)

Then, the semantic-diffused embedding with the hyper-parameter ϵ(0, 1) as:

ˆ

Eq = Eq(IϵPq).

(5.46)

Compare to the unprocessed embedding bag, i.e., Eq, embedding presents a diffused seman-

tic structure with a more balanced spectrum (distribution of singular values) in expectation.